Skip to content

feat(embedder): use summary for file embedding in semantic pipeline#765

Merged
qin-ctx merged 3 commits intovolcengine:mainfrom
yangxinxin-7:feat/use-summary-for-embedding
Mar 19, 2026
Merged

feat(embedder): use summary for file embedding in semantic pipeline#765
qin-ctx merged 3 commits intovolcengine:mainfrom
yangxinxin-7:feat/use-summary-for-embedding

Conversation

@yangxinxin-7
Copy link
Collaborator

@yangxinxin-7 yangxinxin-7 commented Mar 19, 2026

Summary

  • When files in a code repository are processed through the semantic pipeline, use the pre-generated summary (AST skeleton or LLM summary) for embedding instead of raw file content
  • Add is_code_repo flag to SemanticMsg and propagate it through the pipeline: ResourceProcessorSummarizerSemanticMsgSemanticDagExecutor
  • Detect code repositories via source_format == "repository" (set by CodeRepositoryParser) and pass is_code_repo=True when enqueuing semantic processing
  • use_summary in _file_summary_task is now gated on is_code_repo, so plain text / markdown / other non-repo resources continue to embed raw file content
  • Truncate AST skeleton to max_skeleton_chars (12000 chars, ~3000 tokens) before embedding to prevent oversized input
  • Add max_skeleton_chars config field to SemanticConfig

Why

Raw file content was being sent directly to the embedding API even when a semantic summary had already been generated. For large files this caused the embedding API to reject the request with a token limit error (e.g. OpenAI 8192 token limit). Using the bounded summary instead of raw content fixes this.

However, using summary for all file types (including markdown, plain text) was incorrect — for those files the raw content is the meaningful representation. Summary-based embedding is only appropriate for code files where AST skeletons provide a better semantic signal.

Paths unaffected:

  • index_resource direct indexing path (use_summary defaults to False)
  • Memory files (handled separately in memory_extractor.py)
  • Non-repo resources (markdown, plain text, etc.) — always use raw content

Closes

Closes #616

When files are processed through the semantic pipeline (SemanticDag),
use the pre-generated summary (AST skeleton or LLM summary) for
embedding instead of reading raw file content. This ensures code files,
markdown, and other text files within a repository are indexed by their
semantic summary rather than truncated raw content.

- Add use_summary flag to VectorizeTask, _vectorize_single_file, and vectorize_file
- Set use_summary=True in _file_summary_task when a non-empty summary is available
- Truncate AST skeleton to max_skeleton_chars (12000 chars, ~3000 tokens) before embedding
- Add max_skeleton_chars config field to SemanticConfig
- index_resource and memory paths are unaffected (use_summary defaults to False)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

try:
if need_vectorize:
use_summary = bool(summary_dict.get("summary"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有代码会走这个路径吗

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好像有点问题,我再check下

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有代码会走这个路径吗

这个已经改了下,现在是只有code repo的情况才会这样了

yangxinxin-7 and others added 2 commits March 19, 2026 16:55
…ext/doc files

Add `is_code_repo` flag to `SemanticMsg` and propagate it through the
pipeline so that summary-based embedding (AST skeleton) is only applied
when processing a code repository (`source_format == "repository"`).
For plain text, markdown, and other non-repo resources, raw file content
is used for embedding as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@qin-ctx qin-ctx merged commit 59352f8 into volcengine:main Mar 19, 2026
5 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: add-resource sends oversized input to OpenAI embeddings API during repo import

2 participants